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Abstract 

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes 
and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary 
genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that 
are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a 
conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools 
capable of revealing such themes through mining and representing semantic information in an objective and 
quantitative manner. 

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract 
representation of the major biological processes of a list of genes based on their annotations. We cast the task as 
follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the 
genes in a subset are summarized by an informative GO term that accurately captures the semantic information of 
the original annotations. 

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different 
statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative 
power was achieved by using a combination of the information-content-based measure as the information-loss 
metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph. 

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of 
gene functions in a context-specific fashion. 



Background 

Contemporary "omics"-scaled studies often produce a 
large volume of data, where the amount of information 
overwhelms human comprehension and defies manual 
inspection. For example, the Cancer Genome Atlas 
(TCGA) [1] project characterizes thousands of tumor 
samples in terms of gene expression, genetic variations, 
and other molecular biology aspects. Since a cancer 
always results from perturbations of multiple biological 
processes [2], a whole list of differentially expressed 
genes in a tumor inevitably reflects a mixture of genes 
responding to distinct signals. It becomes an increasingly 
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important task to de-convolute the signals reflected by 
molecular data and to represent information at a concep- 
tual level by answering questions such as the following: 
"Given a list of differentially expressed genes from a 
tumor, which major biological processes are perturbed in 
the tumor?" In this study, we aim to develop informa- 
tion-theory-based and quantitative approaches to reveal 
the major functional themes among a gene list in a case- 
specific manner, and to summarize the findings using 
informative concepts. 

We formulate the task as follows: divide the genes in a 
list into non-disjoint subsets, such that genes in each 
subset participate in coherently related biological pro- 
cesses, and such that the overall functional theme of a 
subset can be represented with an informative concept in 
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a bio-ontology. The goal of finding major and informative 
biological processes requires one to strike a balance such 
that a subset should include as many genes as possible in 
order to reflect a major theme of the genes list, and yet 
the concept representing the functional theme of the 
subset should be as specific as possible. Although there is 
a large body of literature on using bio-ontologies to 
represent the information derived from genome-scale 
experiments [3-7], few studies specifically address 
the task of identifying informative concepts to represent 
the information from experimental data in a case-speci- 
fic, objective, and quantitative manner. 

Generally speaking, a method that addresses the above 
task should have the following capabilities: 1) dynami- 
cally selecting concepts that are best suited to represent 
the functional themes specific for a given gene list and 
2) assessing the suitability of a concept to represent the 
information of the genes within a subset as specifically 
as possible. One can utilize the hierarchical organization 
of well-formed bio-ontologies to achieve the first cap- 
ability. The second requirement is more challenging, in 
that it requires a quantitative measure to assess the spe- 
cificity of the concept and assess the statistical signifi- 
cance as an objective criterion to guide the search for 
the balance point between capturing the major theme of 
genes while maintaining the specificity of the theme. 

The Gene Ontology (GO) [8], a controlled vocabulary 
consisting of molecular biology terms (concepts) related to 
genes, is the most widely used bio-ontology for represent- 
ing the information derived from genome-scale experi- 
ments, particularly about aspects of the biological process 
(BP). Currently, a common approach to finding a func- 
tional theme from a gene list is to assess whether any 
GO terms are enriched among the annotations associated 
with a gene list [9-12]. However, annotations by the GO 
Consortium are usually highly specific, and it is not 
uncommon to find a set of specific GO terms enriched in 
a gene list where each term covers only a small number of 
genes, thus failing to reveal the major biological process. 
Aware of the need for more general concepts, the GO 
Consortium provides a set of general GO terms, referred 
to as GO slim [8], that represent high-level biological con- 
cepts. There are also software tools that map/associate 
genes to concepts in the GO slim subsets [13-15]. As will 
be shown, these terms tend to be too general; more impor- 
tandy, this small set of predefined GO terms may not meet 
the need of revealing functional themes in a case-specific 
fashion with balanced generality and specificity. Besides 
the GO enrichment analysis, another widely used 
approach for finding functional themes of a gene list is to 
assess whether members of certain predefined pathways, 
or gene signatures from databases such as KEGG [16] 
pathways and MSigDB [17], are enriched. However, such 
representations lack the ontological structure to support 



further reasoning by computing agents, and thus are 
unable to dynamically find the concepts best suited to 
represent the information of a gene list in a case-specific 
manner. 

In this study, we investigated a framework that utilizes 
the structure and semantic information of the GO to 
reveal major functional themes in a dynamic and gene- 
set-specific manner. We systematically studied different 
information-theory-based metrics to assess information 
loss when searching for suitable representations to sum- 
marize functional themes of gene sets. We further evalu- 
ated different statistical schemes to assess the functional 
coherence of a gene set summarized by a GO term. The 
conceptual overview of our research is shown in Figure 1. 

Methods 

Gene Ontology representation 

We developed an object-oriented Python package that 
can store and represent the following information: 1) 
the structure of the GO; 2) the association of genes to 
the GO terms; and 3) the quantitative measure of infor- 
mation loss when the information represented by a spe- 
cific concept (a GO term) is replaced by a more general 
ancestor concept. We modified and expanded a Python 
package, referred to as GOGrapher [18], previously 
developed by our group. We developed four hierarchi- 
cally organized classes to represent the graph structure 
of the GO. Each type of graph contains a particular com- 
bination of information related to GO terms and utilizes 
a different measure to quantitatively reflect the informa- 
tion difference between a pair of GO terms. The 
GOGraph simply represents the GO terms and the IS_A 
relationships among the GO terms according to the defi- 
nition of the Gene Ontology. The GOGeneGraph further 
stores information on associated gene products. The 
GOPubmedGraph stores information on associated 
PubMed records that can be used to construct the 
semantic context in which the concept represented by a 
term is discussed in biomedical literature. Finally, the 
GOGenePubmedGraph stores both types of information. 
GOGrapher package is publically available at the follow- 
ing URL: http://code.google.eom/p/gographer/. 

Data files containing the complete GO can be 
directly downloaded from the GO Consortium http:// 
www.geneontology.org. The data used to generate our 
graphs was version 1.1.1961, downloaded on June 5, 2011. 
(For the ease of presentation, we generally refer to one of 
the above types of graphs as a "GO graph," and a reader 
can easily find out which type we are referring to based on 
the context of presentation.) 

Measuring information loss 

GO nodes in the aforementioned graphs can be associated 
with genes and/or biomedical literatures; thus, each term 
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Figure 1 Conceptual Overview of Research. A-C. The ontological structure of the GO, protein annotations, and biomedical literature 
associated with genes were collected. D. The above information was combined to create an integrated graph (GOGenePubmed) that reflects 
the relationship among genes, their annotations, and the semantic relationships between GO terms. E. Based on this graph, statistical schemes 
were designed and simulation experiments were performed to establish statistical models for assessing the functional coherence of gene sets. 
F-G. When provided with a gene list from experiment (F), the program can be used to search for coherent subsets among the list (G). 



contains information with respect to its genes and seman- 
tic context. When the information represented by a speci- 
fic GO term is replaced by that of a more general term, 
information-theory-based measures can be used to assess 



how much information is lost. The graph representation 
of the GO conveniently allows us to represent such a loss 
as the weight of an edge between a parent-child pair of 
GO terms. In this study, we investigated two different 
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measures as edge weights: 1) an information bottleneck 
(IB) -based measure, which assesses the loss of informa- 
tion with respect to the semantic context of the literatures 
associated with GO terms; and 2) a measure based on the 
information content (IC) of GO terms with respect to 
genes, which assesses the loss of information related to 
genes. The calculation of edge weights was based on the 
annotation data obtained from the UniProt multispecies 
annotation file downloaded on May 10, 2011, from the 
GO Consortium. 

Information bottleneck 

To represent the semantic context when the concept of 
a GO term is discussed in literatures, we collected the 
titles and abstracts of PubMed records associated with 
each GO term. The PubMed records were preprocessed 
as described in our previous work [19], including by 
tokenization, stemming, and removing stop words and 
punctuations. Based on the IS_A relationship specified 
in the GO hierarchy, PubMed records are recursively 
propagated to ancestor nodes. To represent the semantic 
context of a GO term, we constructed a word-frequency- 
count vector, collecting tokens from the name and 
description of the GO term, and from the titles and 
abstracts of PubMed articles, to reflect the profile of 
words used to discuss the concept. 

An IB-based metric can measure the amount of infor- 
mation loss when text documents are compressed into 
clusters [20,21]. In our previous work, we extended the 
metric to reflect the information loss when the semantic 
context of a GO term, represented as a word-usage profile, 
is represented by that of its parent node [19]. The infor- 
mation loss measured between a term and its parent is: 

SI (fi) = p (t.) Djs (p (w\u) , p (w\t p )) (1) 

where 



D rs {p{w\ti),p(w\tp)) = tt.Dkl (p{w\U)\\p{w\tp)) , (3) 



Dkl(p(w\u) \\p(w\tj)) = EP(v>v\U)l°B , v (5) 

v=l P{U>v\tj) 

In the equations, Dj S represents the Jensen-Shannon 
divergence, 1 1\ is the number of descendant terms of term 
t, D KL represents the Kullback-Leibler divergence, pwdes- 
cendant terms of term t, D KL represents the Kullback- 
Leibler divergence, p(w\t) is the distribution of words 



associated with the GO term t, and V is the word vector. 
We used these formulae to calculate the information loss 
for each child term in relation to its parent. 

Information content 

When GO terms are used to annotate genes, they contain 
the information with respect to genes. The amount of 
information a term has with respect to genes is commonly 
determined with an information-theory-based quantity 
referred to as information content (IC) [22-26]. The equa- 
tion used to calculate the IC of a GO term is 

IC (t) = -In (P(t)), (6) 

where P(t) is the number of gene annotation instances 
for the term t divided by the total number of gene annota- 
tion instances in the entire GO annotation database. 
Then, the difference in IC between a pair of parent-child 
nodes can reflect the amount of information that is lost 
with respect to genes when the child concept is collapsed 
into the parent node, and is calculated with: 

dist (t p ,t c ) = | IC (t p ) -IC(t c ) | . (7) 

Identify major functional themes from a gene list 

When given a gene list, our goal is to separate the genes 
into non-disjoint subsets represented by informative GO 
terms that reflect the major functional themes. To this 
end, we developed the following procedure. 1) We repre- 
sent the GO structure using a GO graph, and associate the 
genes to the GO terms according to their annotation file. 
2) Terms without directly associated genes or any descen- 
dant with associated genes are then removed. 3) We begin 
merging leaf terms into their parent, starting with the 
terms that would result in the least information lost. If a 
term has more than one parent, it is merged into the par- 
ent that would most greatly minimize the total informa- 
tion lost. 4) After each merging, the parent term becomes 
a summarizing GO term annotating a set of genes, includ- 
ing the genes directly associated with the term, as well as 
the genes associated with its descendants. In other words, 
a summarizing GO term is annotating the genes from the 
subgraph beneath it. A merge will result in the loss of 
information with respect to genes as well as the loss of 
information with respect to the semantic context of the 
GO terms. 5) Applying the statistical analysis (to be dis- 
cussed in the following subsections), we determine the 
information loss resulting from each merging and stop the 
merging of a GO term to its parent if it would lead to a 
set of genes that is deemed functionally incoherent. 6) 
Repeat steps 1 through 5. Eventually, this procedure will 
lead to a set of GO terms annotating the genes propagated 
from its subgraph. Thus, we achieve the goal of grouping 
genes into non-disjoint subsets, and the overall functional 
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theme of each subset is represented by the final associated 
term. 

Statistical scheme for assessing the functional coherence 
of a subset 

Given a set of genes produced during the merging process 
discussed in the previous subsection, we assess whether 
the functions of the genes are coherently related. This is 
determined by measuring how closely the genes are 
related to each other within a GO graph based on their 
function annotations: the more closely they are related to 
each other, the smaller the information lost during the 
merging process-and the greater the coherence of the 
gene set. We devised two graph-based statistics schemes 
to reflect the information loss created by the merging of 
the genes: 1) the total weight of the Steiner tree connect- 
ing the genes within the subgraph beneath a summarizing 
GO term; and 2) the total weight of an augmented Steiner 
tree connecting the genes within the subgraph beneath the 
summarizing term (see below). 

Given a set of merged genes, we calculated its graph- 
based statistics using IC-based or IB-based information 
loss (see Section 2.2) as the edge weight. In the first statis- 
tics scheme, we found a Steiner tree [26,27] that con- 
nected the original GO terms annotating the genes within 
the subgraph and used the total weight of the edges as the 
coherence statistic. In the second statistical scheme, we 
introduced an edge between a pair of GO terms that anno- 
tate a common set of genes. This approach addresses the 
phenomenon that a gene can be annotated with distinct 
and yet closely related GO terms by different curators due 
to annotation inconsistency, which would artificially inflate 
the information loss reflected by the first statistic. Through 
adding an augmented edge, we aimed to collapse redun- 
dant annotations by assigning a small weight between 
them: 



di,j 



d Ps G {d G } 

u 



where i and j indicate two GO terms, d G is the set of 
edge weights for all the edges in the GOGraph G, 
dP5dP5 is the value of the weight of the 5th percentile, 
and \g it j\ is the number of genes that are shared between 
the GO terms i and /' [26]. After augmenting the graph, 
we then found a Steiner tree and used its total weight to 
reflect the total information loss. 

Statistical models for assessing coherence 

Given a graph-based statistic reflecting overall information 
loss resulting from merging a set of genes under a GO 
term, we further devised statistical models to assess the 
functional coherence of the gene set. Here, we define a set 



of genes to be functionally coherent if the information loss 
is significantly less than that derived by merging a similarly 
sized set of randomly selected genes. To this end, we 
performed a series of simulation experiments drawing ran- 
dom gene sets, and developed statistical models (which 
can also be used to assess the significance of the graph- 
based statistics for each gene) to represent the distribu- 
tions of random gene sets. 

Since the graph-based statistics for a subset of genes can 
vary according to the size of the subset, we randomly drew 
gene sets ranging from 5 to 200 genes, with varying step 
sizes. To establish a distribution of statistics for random 
sets of a given size, we repeatedly drew 100 random gene 
sets of a given size. The statistics from these random sets 
were used to develop the following statistic for assessing 
the statistical significance. 

Given a gene subset of size n and its graph-based statis- 
tic y s , we represent distribution of random gene sets of 
size n with a Gaussian distribution defined by two para- 
meters: the means {fi n ) and variances (anl). We use the 
Nadaraya- Watson non-parametric regression [26,28] to 
capture the non-linear relationship between the size n and 
the parameters of random gene sets of the same size, 
using the following equations: 



Ei=l w n Yi 



Ew w n («>) 



EU u 'n ("■) (Yi - An) 



w n (rii) = exp 



--(Hi - ny 



(9) 



(10) 



(11) 



where D is the entire dataset of calculated lengths, 
w n {rii) 

is weight for the /-th observation, and w n is 



(8) Ew w "( n *) 



the Gaussian kernel function with the bandwidth para- 
meter h [26]. The h was set to 10 for the regression cal- 
culations in this study. 

We then assess whether the statistic y s of a gene set of 
size n belongs to the population of random gene sets 
defined by \i n and cn2, and calculated a /"-value to reflect 
the significance. The /7-value for a new gene set with sta- 
tistic y n was calculated as the probability that y n belongs 
to the Gaussian distribution defined by parameters \A n 
and an2, which can be calculated using the distribution 
function as follows: 



F(yn) 



1 



'2tt 



in 

/ 



-(xn-M„r/2ff„ 2 



(12) 
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These calculations were carried out using the Python 
package rpy2 http://rpy.sourceforge.net/, which uses the 
statistical language R http://www.r-project.org/. 

Assessing discriminative power of statistical models 

To determine which combination of the aforementioned 
information loss metrics and statistical schemes is the 
best in terms of differentiating a coherent gene set from 
a random gene set, we performed a series of experiments 
using the human pathways listed in the KEGG database 
as coherent gene sets; for each pathway, an equal number 
of genes was randomly selected to use as a non-coherent 
gene set for comparison. The information loss resulting 
from merging members of a KEGG pathway or a random 
gene set was calculated using each metric and statistic 
combination, and the p-value of a gene set that belongs 
to the population of random gene sets of the same size 
was also calculated. By setting a cutoff p-value, we could 
classify a gene set as coherent or non-coherent, allowing 
us to perform ROC analysis and assess the discriminative 
power of a specific combination of information loss 
metric and graph-based statistical scheme. 

Protein-protein interaction evaluation 

We quantified the ratio of the actual protein-protein inter- 
actions (PPI) within a gene set over the maximal possible 
interactions as another measure reflecting the relatedness 
of genes within a gene set, with the expectation that a 
more coherent gene set would have a greater number of 
PPI. The PPI data was obtained from BioGrid [29], where 
we used the human data of version 3.1.85. The ratio is 
calculated using the following equation: 



where / is the number of existing interactions in the 
gene set, and g is the number of genes in the gene set. 

Results 

Characterization of different information loss measures 

We first set out to assess which of the two information- 
loss measures, the IC-based and the IB-based, best fit our 
goal of assessing information loss when genes annotated 
by highly specific GO terms were merged under a general 
GO term. We compared the distribution patterns of edge 
weights represented in different metrics to study their 
characteristics. Figure 2A shows the histograms of the 
edges conditioned on edge lengths when calculated using 
either IB-based (left panel) or IC-based information loss 
(right panel). The figure shows that the numeric scales of 
the two metrics are of different orders of magnitude. It 
also shows that the distributions are mainly dominated by 
edges with relatively short distances. This was the finding 



we anticipated, because there are more edges close to the 
leaf level in the GO hierarchy, where the differences in 
terms of semantic context or protein information are 
expected to be small. The distribution of the IB-based 
edge weights exhibits a smoother transition, whereas the 
IC-based edge weights demonstrate a peak at edge length 
zero and a quick drop afterwards, and the distribution 
contains certain spikes. 

In Figure 2B, we plot the distribution of the edge 
length conditioned on the level of the edge away from 
the root node. It shows us that the edge length 
decreases as the node level increases for both IB and IC, 
indicating the diminishing differences between a pair of 
parent-child terms when the concepts becomes more 
specific. The figures also show that the outliers domi- 
nate the distributions of the IB-based edge at multiple 
levels; in contrast, the distributions of IC-based edge 
weights reveal far fewer outliers. Thus, it is apparent 
that the distribution of IB-based edge weights has a 
much lower signal-to-noise ratio in comparison to that 
of IC-based edge weights. 

Evaluation of discriminative powers of graph-based 
statistics 

Since different information-loss measures may perform 
differently when combined with different statistical 
schemes, we tried to determine which combination of 
information-loss measure and graph-based statistical 
scheme had the highest discriminative power. To this 
end, we tested whether our model was capable of accu- 
rately differentiating between known, functionally coher- 
ent gene sets and random gene sets. Using the human 
KEGG pathways as "functionally coherent" gene sets, we 
randomly drew for each a matching gene set of the same 
size as a non-coherent one. For each KEGG pathway of 
size n with a graph-based statistic y„, we calculated the 
p-value to which the gene set belongs to the population 
of random gene sets with size n. With a p-vahxe as the 
classification threshold, the model classified a gene set as 
coherent or non-coherent. The discriminative power of 
the model was then assessed using the area of an ROC 
curve, reflecting how accurately our models distinguish 
KEGG pathways from the random gene sets. 

Figure 3 shows the results of the analyses of combin- 
ing IC-based information-loss measures with different 
graph-based statistical schemes. In Panel A (top two 
plots), we combined the use of IC as information loss 
with the length of Steiner trees derived from a GOGen- 
eGraph graph as statistic. The figure shows that this 
combination cannot separate KEGG pathway gene sets 
from the matched random and simulated random gene 
sets. Thus, the discriminative power of the model is 
poor, as shown in the ROC curve. In contrast, the com- 
bination of the use of IC as the information loss metric 
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Figure 2 Distribution of Edge weights. Both A and B are organized with the IB-based edge weight plot on the left and the IC-based edge 
weight plot on the right. A. Distribution of the shortest 90% of edges in the entire graph. B. Boxplots of the edge weight distribution organized 
according to the level of edge, where level 0 contains edges that connect to the root. 



and the length of Steiner trees derived from an augmen- 
ted GOGeneGraph as statistic exhibited significant differ- 
ences in the distribution of the data points from the 
KEGG gene sets and random gene sets. Thus, the corre- 
sponding model revealed a much higher discriminative 
power. The results of ROC analysis of all pairwise combi- 
nations of information loss metrics and graph-based statis- 
tics schemes are shown in Table 1; they indicate that the 
combination of IC and Steiner tree from an augmented 
GO graph performs the best. 

Finding multiple function aspects of KEGG pathways 

In the previous section, we used the gene sets from the 
KEGG pathway database as the surrogates of coherent 
gene sets to compare the discriminative power of statisti- 
cal models utilizing different combinations of informa- 
tion loss metrics and graph-based statistics. However, we 
noted that many KEGG pathways contain a large number 
of genes performing diverse functions, and our model 



classified them as non-coherent gene sets. Instead of sim- 
ply treating such calls by our model as errors, we further 
investigated whether it made sense to treat a large KEGG 
pathway gene set as a coherence gene set, and whether it is 
more sensible to use our approach to identify fine-grained, 
coherent gene sets from such a pathway. Figure 4 shows an 
example of one such KEGG pathway (hsa04010): the 
human MAPK signaling pathway, which includes 262 
unique genes (not all are shown in the figure). This KEGG 
pathway comprehensively includes many cellular signal 
transduction components sharing the proteins involved in 
the MAPK cascade, including growth factor signaling path- 
ways and the signaling pathway that induces apoptosis. As 
such, it may not be biologically sensible or even possible to 
find an informative concept to summarize the diverse bio- 
logical processes of genes in this KEGG database entry. 
Indeed, we tried to search for a GO term to cover all the 
genes listed in this pathway, a process which led to the 
most uninformative term of the Biological Processes 
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Figure 3 Example of distributions of graph-based statistics and the discriminative power of coherence models. A. In the scatter plot 
(top left), statistics derived from KEGG gene sets (red), the matching random gene sets (blue), and simulated random gene sets during model- 
building (green) were plotted. Top right is the corresponding ROC curve. B. Scatter plot of the graph-based statistics and ROC curve of the 
model using IC as information loss and Steiner tree length derived from augmented GO graph. 



domain, the root GO term. Therefore, it is sensible that 
our model treated the whole set of genes as not coherently 
related. When applying our method to the gene list of this 
pathway to search for coherent subsets, our model 
returned a number of non-disjoint gene subsets, reflecting 
different aspects of the functions in which these genes par- 
ticipate. For example, one subset was summarized by the 



Table 1 ROC analysis results for information loss metric 
and statistical scheme combinations. 
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combination of information loss metric and statistical scheme. 



GO term GO:0023014 (signal transduction by phosphoryla- 
tion), which included most of the genes in the pathway that 
are involved in the protein phosphorylation process, 
including MAPK kinases, and are shown as the genes in 
green. Another facet of the functions of these genes was 
summarized by the GO term GO:0006915 (apoptosis), 
which included many genes with well-known roles in the 
process of apoptosis, and are shown as the genes in blue. 
Thus, in this figure, these two concepts reflect two func- 
tional themes of the genes with a suitable level of specificity 
and generality; the rest of the concepts representing differ- 
ent functional themes are listed in Additional File 1. 

Application in real world data analysis 

We then investigated the application of our method in a 
real-world data analysis experiment to illustrate the 
challenges a bioinformatician often faces when using 
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Figure 4 An example of a KEGG pathway containing genes involved in multiple processes The "MARK signaling pathway" (hsa04010) is 
shown. Two functionally coherent subsets are highlighted. The genes summarized by GO:0023014 {Signal transduction by phosphorylation) are in 
green; the genes summarized by GO:0006915 (Apoptotic process) are in blue. Genes involved in both biological processes are in yellow. 



conventional functional analysis to detect functional 
themes from a long gene list, and to show the advan- 
tages of our approaches. 

Cancers result from genomic perturbations that lead to 
changed cellular signal pathways. A common manifesta- 
tion of a perturbation in a signaling pathway is the 
altered expression of modules of genes that carry out 
specific biological processes [30]. Another important 
characteristic of cancers is that a tumor results from a 
combination of perturbations in multiple signaling path- 
ways, thus manifesting as the perturbed expression of 
genes involved in multiple biological processes [2]. As 
such, a list of differentially expressed genes from a cancer 
sample represents a mixture of responses to different 
signal perturbations; therefore, de-convoluting the 



biological processes reflected in such a list of differen- 
tially expressed genes is a critical task. 

As a concrete example, we identified a list of genes 
from the ovarian cancer samples collected from TCGA. 
The list included a total of 837 genes, each of which was 
deferentially expressed in at least 5 tumor samples [31]. 
We first set out to evaluate whether it was suitable to use 
the original GO annotations, GO annotation enrichment 
analysis, and GO slim mapping to identify function 
themes. We found that a total of 2,175 unique GO terms 
from the BP domain were associated with the genes in 
the list, and that the median number of genes annotated 
by these GO terms was 1. The distribution of the number 
of genes annotated by these terms is shown as a box plot, 
labeled as "original", in Figure 5A. We then performed a 
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conventional hypergeometric-distribution-based GO 
term enrichment analysis to identify the "enriched" GO 
terms at a cutoff j?-value of 0.05. The analysis resulted in a 
set of 433 unique GO terms, the median number of genes 
annotated by which was 4 (see Figure 5A). Since the size 
of gene modules associated with each enriched GO term 
appeared to be too small to represent the "major" themes, 
we further studied the genes mapped to the human GO 
slim terms, finding a total of 70 human GO slim terms, 
the median number of genes mapped to which was 75. 

When applying our method to the gene list, we can use 
different /7-values as cutoffs for grouping genes into subsets, 
with a smaller p-value associated with a set whose genes are 
more coherently related to one another. Using a cutoff 
j?-value at 0.01, our method identified 70 subsets deemed to 
be functionally coherent; the median of the number of 
genes within these subgroups was 8 (see Figure 5A). When 
we relaxed the information loss requirement by setting the 
cutoff ^-values at 0.05, we identified fewer modules as 
more genes were merged into these subsets; the median 
number of genes covered by these terms was 10. 

To investigate the degree of specificity of the summariz- 
ing GO terms derived from different methods, we plotted 
their distribution based on their level, i.e., the number of 
steps away from a root a GO term was found to be (see 
Figure 5B). The results show that the GO slim terms tend 
to be very close to the root; thus, in comparison to other 
methods, these concepts tend to be very general, and lack- 
ing in specific information. The figure also shows that the 
original and "enriched" terms broadly span the GO hierar- 
chy, with more terms concentrated at very specific levels. 
Combining the information conveyed by the two panels, 



one can draw the conclusion that the original and 
"enriched" GO terms tend to be highly specific, annotating 
only a small number of genes. In comparison, the sum- 
marizing GO terms identified by our methods tend to be 
more specific than the GO slims but more general than 
the original and "enriched" annotations, with the terms 
concentrating at levels 5 and 6, and achieving a better bal- 
ance between the number of genes covered and the speci- 
ficity of the terms. 

To support the notion that the statistical model devel- 
oped by this study effectively measures the functional 
coherence, we assessed the functional relatedness of the 
proteins in a subset returned by our models using another 
measure, the within-module PPI ratio, and compared the 
results. In a series of experiments, we applied our model 
to the differentially expressed gene set using 3 different 
p-vahie cutoff thresholds (01, 0.05 and 0.01), leading to 3 
collections of modules. We then investigated whether the 
within-module PPI ratio exhibited an anti-correlation with 
^-values, based on the assumption that a more coherent 
gene set (with a smaller p-val\ie) should generate more 
within-module PPIs. Figure 6 plots the within-module PPI 
ratios of the gene modules derived using different p-vahie 
cutoff thresholds, as well as the value derived from the 
modules produced by mapping genes to the Generic GO 
slim. The results indicate that, indeed, the more stringent 
the jj-values, the higher the within-module PPI ratio; the 
genes grouped by the GO slim had the lowest within- 
module PPI ratio. Our metric agrees with another biologi- 
cally sensible metric reflecting the functional coherence of 
genes. A similar finding by Dutkowski et al. [32] corrobo- 
rates our results as well. 




Level in GO Hierarchy 

Figure 5 Distribution of genes associated with summarizing GO terms A Boxplots of the distribution of number of genes associated per 
term under five different conditions: original GO annotation; enriched GO annotations; our method with a p-value < 0.01 and 0.05 thresholds; 
and the Generic GO slim. B. Plot of the proportion of GO terms per level in the GO hierarchy under the five different conditions. 
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Figure 6 Average within-module PPI ratios for summarizing 
terms. Plots of the calculated average within-module PPI ratio for 
the summarizing terms that resulted from different thresholds of 
merging and use of the AMIGO GO Slim tool. The whisker denotes 
the calculated standard error. 



Discussion and conclusion 

The goal of this study was to develop and evaluate 
approaches that can be used to conceptualize molecular 
findings and capture the major functional themes of genes. 
Conceptualization of information from experimental data 
is a critical step toward deriving knowledge from data [33] . 
We assessed different metrics to measure information loss 
and its combination with different schemes to derive statis- 
tics reflecting the overall information loss that would result 
from grouping genes under a summarizing concept in the 
Gene Ontology. Our results indicate that the best discrimi- 
native power was achieved by combining the IC-based 
information-loss metric with the statistics derived from the 
Steiner tree connecting the genes on an augmented GO 
graph. We further demonstrate that our methods provide a 
novel approach to revealing major functional themes of a 
list of genes in an objective and quantitative fashion. 

It is worth pointing out that, while the IC-based 
metric has been widely used to reflect the "semantic" 
similarity of GO terms [22,23,34,35], the metric does 
not really reflect the semantic content, but rather repre- 
sents the amount of information a GO term reveals 
with respect to genes. As such, the difference of ICs 
between two terms reflects the difference in the infor- 
mation with respect to genes; therefore, the phrase 
"information loss with respect to genes" would be more 
appropriate terminology for the quantity. In previous 
work, we proposed to use the word-usage profile of the 
biomedical literatures in which the concept is encoded 
by a GO term to represent the semantic context of a 
GO concept; we further proposed the use of the 
IB-based metric to measure the difference in the seman- 
tic context of GO terms. Since this metric is based on 
natural language words, the tokens conveying semantic 
meanings for humans, we reasoned it should be a more 
suitable metric for assessing how closely the functions of 
genes are related to each other based on the annotation. 



The smooth distribution of edge weights measured with 
the IB-based metric seemingly supported the notion that 
the metric measures a gradual change of semantic differ- 
ence between GO terms. However, one of the premises 
for this metric to be stable is that each GO node is asso- 
ciated with a large number of documents in order to pro- 
duce a representative word-usage profile-a premise 
which was not borne out for many GO terms in our case. 
Instead, we observed a relatively small signal-to-noise 
ratio for the metric, rendering it inferior in terms of dis- 
criminating known coherent gene sets from random 
ones. A possible approach is to either collect more train- 
ing documents associated with GO terms or to design 
new metrics based on a word-usage profile that is rela- 
tively stable. Nonetheless, since the ultimate goal of the 
IC-based and IB-based metrics is to provide discrimina- 
tive power for differentiating coherent gene sets from 
non-coherent ones, our experiments indicate that the IC- 
based metric is up to the task, and that all information 
contained by a GO term is of value. 

Additional material 

' s 

Additional file 1: Gene Ontology terms summarizing the MAPK 
signaling pathway 
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